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Abstract 

The estimation problem in a high regression model with structured sparsity is investigated. 
An algorithm using a two steps block thresholding procedure called GR-LOL is provided. 
Convergence rates are produced : they depend on simple coherence-type indices of the Gram 
matrix -easily checkable on the data- as well as sparsity assumptions of the model parameters 
measured by a combination of li within-blocks with l q ,q < 1 between-blocks norms. The 
simplicity of the coherence indicator suggests ways to optimize the rates of convergence 
when the group structure is not naturally given by the problem and is unknown. In such a 
case, an auto-driven procedure is provided to determine the regressors groups (number and 
contents). An intensive practical study compares our grouping methods with the standard 
LOL algorithm. We prove that the grouping rarely deteriorates the results but can improve 
them very significantly. GR-LOL is also compared with group-Lasso procedures and exhibits 
a very encouraging behavior. The results are quite impressive, especially when GR-LOL 
algorithm is combined with a grouping pre-processing. 

Keywords: Structured sparsity, Grouping, Learning Theory, Non Linear Methods, 
Block-thresholding, coherence, Wavelets 



1. Introduction 

In this paper, the following linear model is considered 

Yi^XtP+Wi, i=1,...,n (1) 

with a particular focus on cases where the number k of regressors X = (X.i , . . . , X.k) is 
large compared to the number n of observations (although there is no such restrictions). Y 
(respectively W) is denoting the n dimensional observation (respectively the error term). 

We are interested by the estimation of the parameter (3 and we consider the situation 
where the expectation of the observation can be approximated by a sparse linear combina- 
tion of the available regressors. A natural method for sparse learning is Iq regularization. 
Since this optimization problem is generally NP-hard, approximate solutions are generally 
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propo sed in prac t ice. S t andard approaches are U repolarization, such as Las so (see by ins 



tance 
(see 



Tibshirani ( 



Bickel et al. 



fl2008f ) and Meinshausen and Yul ( 120091 ) ) and Dantzig 



1996). 

Candes and Taol ( 20071 )). Another commonly used approach is gr e edy a lgorithms, such 



as the orthogonal matching purs uit (OMP) (see iTropp and Gilbertl ( 120071 )) or the itera- 



tive thresholding algorithms (see iKerkyacharian et al 



(120091 ) ) . In many practical applica- 



tions, one often knows a structure on the coefficient vector (3 in addition to sparsity. For 
example, in group sparsity, variables belonging to the same group may be assumed to be 
zero or nonzero simultaneously. The idea of using group sparsity has been largely explo- 
red. For example, g r oup sp arsity has been considered for simultaneou s sparse approx imation 



sec 



Wipf and Raol (120071) ) and multi- task compressive sensing (see 



Ji et al 



(|2009|)) to the 



tree sparsity (see iHe and Carinl (120091 ) ). Numerous applications of these types of regulari- 
zation schem e arise in the context of multi-task learning and multiple kernel learning (see 



Bach! (120081 ) 



Jenatton et al. 



( 120111 )). To combine sparsity with grouping. Lasso has been 



Yuan and Lin 



2006). Various 



Zhao et al. 



Jacob et al 



extended to the group Lasso in the statistical literature by 
combinations of norms allo wing grouping have been introduced as in 
Meier and Buhlmannl ( 120081 ) study the logistic regression model while 
is concerning by the graph lasso. These grouping strategies have been shown to improve 
the prediction performan ce and/or interpretabil i ty of t he learned m o dels when the block 



(120091) 
J2009I ) 



struct u re is relevant (see Koltchinskii and Yuan! |2010 ). Huang et al. ( 2009 ^ 



Friedman and Tibshirani 



( 120101 ). Lasso and 



( boilk bhiquet and Charbonnierl ( 201lh ). In 
Group Lasso are combined in order to select groups and predictors within a group. 
In the sequel, we address the following program : 



jQunici et al. 



ockwise two 



Mougeot et al. 



• GR-LOL algorithm. We investigate the theoretical performances of a b . 
step th resholding algorithm. As LOL (standard two steps thresholding algorithm '. 
( 20121 )) is a counterpart of Lasso or Dantzig algorithms for ordinary sparsity, we introduce 
here GR-LOL, based on the same precepts, combining an a priori knowledge of grouping. We 
establish the rates of convergence of this new procedure when the parameter (3 belongs to a 
set of structured sparsity : the sparsity is measured by combination of £ q -between blocks with 
£i-within blocks norms (see (1121)). Although structured sparsity with overlapping groups of 
variables constitutes an important source of practical examples (hierarchical structure for 
instance), we focus in this paper on the non-overlapping case. To emphasize the practical 
interest of the GR-LOL algorithm, we also explicitly show cases where non grouping induces 
an accuracy loss compared to grouping. 

• Grouping strategy. As explained in the examples above, in some application cases, 
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the grouping of the predictors occurs quite naturally or is driven by some precise require- 
ments : hierarchical structures or multiple kernel learning... However, in various cases (for 
instance in genomic), there is no obvious grouping at hand. In such a setting, we provide 
grouping strategies which, combined with a GR-LOL algorithm, aim at improving the rates 
of convergence. These grouping strategies are issued from the following observations. Concer- 
ning the standard case (no grouping), although the two steps thresholding algorithms show 
quite comparable performances with Lasso and Dantzig procedures with much less computa- 
tion cost, they require theoretically more stringent conditions on the matrix of predicto rs X, 
namely coherence conditions instead of RIP- type conditions (see iMougeot et al.l ( 120121 )). In 
the case of structured sparsity, this becomes surprisingly favorable, since the required condi- 
tions -which are adaptations to the structured case of the coherence conditions- become much 
more readable, and especially open opportunities to improvements with grouping strategies. 
We are able to isolate simple quantities measured on the predictors X yielding optimizing 
strategies to select a structure on the predictors. 

• Practical study. An intensive calculation program is performed to show the advan- 
tages and limitations of GR-LOL procedure in several practical aspects as well as its com- 
bination with different grouping strategies. Based on simulations, the benefices of grouping 
the predictors is compared to the non grouping case for prediction sparse learning. We show 
that the way of grouping the regressors may be critical especially when there exists some 
dependency between the regressors. Using simulated data, we observe that smart strategies 
of grouping strongly improve the predicted performances. 

The paper is organized as follows. In Section |2} notations and general assumptions are 
presented. Examples of grouping are enlightened. In Section [3j the procedure GR-LOL is de- 
tailed. In Section HI we state the theoretical results concerning the performances of GR-LOL. 
In Section [5J we first detail explicit examples where grouping does improve the performances, 
we then discuss strategies to 'boost' the rates of convergence. The practical performances of 
GR-LOL are investigated in Section [6] and the proofs are detailed in Section [71 

2. Assumptions on the model and examples 

We first introduce some notation for the predictors grouping. Next, we state the assump- 
tions on the model : conditions on the noise, on the unknown parameters to be estimated 
and on the predictors. We end this section with examples of models where specific grouping 
are proposed. 

In the sequel, for any subset X of {1 , . . . , k}, Xj denotes the matrix of size n x #(X) 
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composed of the columns of X whose indices are in X. In the same way, Uj is the restriction 
of the vector u of IR k to the vector (of dimension #(X)) of its coordinates with indices 
belonging to X. Moreover, ||u||x,i and ||u||x2 denote respectively the l^norm and l 2 -norm of 
the restriction Ui of u G l k : 

= l u d an d ll u lll,2 = l u fl 2 - 

lex eel 

2.1. Grouping 

We consider the model (pQ). We consider a partition Q], . . . , Q v of the set {1 , . . . , k} of the 
indices of the regressors. For any j in {1 , ... , p}, tj = denotes the cardinal of the group 

Qy We decide to subdivide the k predictors into p (p < k) groups of variables Xg, , . . . , Xg p , 
according to this partition. Following this subdivision, for each I = 1 , . . . , k, the predictor 
Xf is now registered as X^t) where 

- j G {1 , . . . , p} is the index of the group where the index I belongs, 

- t = tj (£) G {1 , . . . , tj} is the rank of I inside the group Q r 

The notation I — (j,t) is used all along the paper. The group of indices is then identified 
with {(j,t) for t = 1, . . .,tj}. The index t will sometimes in the sequel be assimilated to a 
'task index' in analogy to the forthcoming example 12.3.21 

2.2. Assumptions 

2.2.1. Homogeneousness condition for the predictors 

To take into account the natural inhomogeneity of the data, we define a normalizing 
constant rif depending on i G {1 , . . . , k}. It appears naturally as a 'normalizing constant' 
through the forthcoming assumption ( flOl) . Setting Xu = X^/y^ for any observation i = 
1 , . . . , n, the model becomes 

Y = Xa + W (2) 

where 

«•« = v^fPf f° r an y £ = 1 > ■ • ■ > k 

In the sequel we assume that there exists a sequence v n and constants < a < b such that 
for any I G {1 , . . . , k}, we get 

(A1): av n <n £ <bv n . (3) 
The quantity v n is important because it drives the rates of convergence of our algorithm. 
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2.2.2. Conditions on the predictors 

Denote V = X t X the Gram matrix of X and Pj = X^Xj the Gram matrix of Xj. Observe 
that Tui is the scalar product between two predictors X^ and Xp and define the coherence of 
the Gram matrix 

y := sup |r a #| (4) 
W)e{i,...,k} 2 ,e^F 

Recall that each I in {1 , . . . , k} is registered as a pair of indices (j, t) where j is the index of 
the group and t the rank of i inside the group and then 

{t ^ t' i and t' are not observed at the same 'task' 

or 
t = t', j j' l and I' are observed at the same 'task' but in different groups 

We split y as y B j V ybg where 

y BT := sup sup | r (j,t)(j',t') | ( 5 ) 

(j,j')6{i,...,pF te{i > ...,tj} ( t'e{i > ...,t j ,} ( t^t' 

and 

y BG := sup sup |r ( j )t )(j',t)| • (6) 

(j,j')e{i > ...,p} 2 ( j^'te{l,...,t j At j ,} 

For any subset X of the set of indices {1 , . . . , k}, let t(X) and r(X) be the following indicators 

T(X):=#(X)y BT + #({j, 3t, (),t) Gl})y BG . (7) 

r(X) := #(X) y 2 BT + 3t, (j, t) E X}) y 2 BG . (8) 
In particular, for any j G {1 . . . ,p}, we define 

t) = x(^j ) = tj y BT + Ybg and r } = t(^j ) = tj y BT + y BG 

as well as 

t* = max Tj = t* y BT + y BG and r* = max r* = t* y| T + y| G . (9) 

j=i,...,p j=i,-,p 

where t* = maXj = i v .. )P (tj). 

Let us state now the assumptions on the regressors X. First, we assume that the columns 

of the matrix X are normalized : 

(A2) : r u = 1 for any I = 1 , . . . , k. (10) 

Second, we assume that 

(A2') : t* < v. (11) 
for some v given in ]0, 1 [. Observe that under (A2), we obviously have r* < t*. 



2.2.3. Conditions on the unknown regression parameters 
Assume that there exist q < 1 and M, M' > such that 

p p 

(A3) : Y. Wh ^ (M ' )q or e( l uivalentl y L H a lle.,i ^ M ' ^ 
i=i j=i 

2.2.4- Conditions on the noise 
Finally, we assume 

(A4) : W is a vector of i.i.d. variables A/"(0, cr 2 ). 

Notice that the Gaussian distribution assumption may be replaced without modifications by 
a sub-Gaussian distribution with zero mean and variance a 2 . 

2.3. Specific models. Examples. 
2.3.1. No-group case 

One specific case of our modeling is when Qj = { j } for any j G {1 , . . . , k} : the no-group 
setting which corresponds to p = k. Here, the predictors are generally normalized by the 
number of observations ri{ = n and the homogeneousness condition ([3]) is ordinary satisfied 
for v n = n. Moreover, y B T = and 

Ybg = sup \r u >\ 

is the coherence of the matrix V. We get T* = Ybg and Conditio n tTTTD becomes Yrg < y - 
Note that a similar condition is used in 



Kerkyacharian et al.l (120091 ) or lMougeot et al 



(120121 ). 



The regularity conditions in this case sum up to a l q condition on the parameter vector |3. 
2.3.2. Multi-task case 

An interesting case where many conditions find direct interpretation is the multi-task 
regression model defined by the pile of T linear models : 

Yi =X 1 (3 1 +W 1 
Y 2 =X 2 (3 2 + W 2 

Y T =X t (3t + W t 

Here Xi, Xj are no x p design matrices and Wi,...,Wj are (independent) error terms. 
This modeling is used (for instance) to introduce a time variation : the target variable Y and 
the predictors Xi , . . . , X p are observed on T different periods of time. We prefer the term task 
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to time not to induce confusion with the 'observation times' i. For each task the observation 
consists in a vector Y t of size no, analyzed on the matrix of predictor X t . Model (TT3]) can be 
globally rewritten as Model (JT|) with 



n = n T and k = pT, 



the design matrix X being block diagonal with blocks Xi , . . . , Xj and 



P = (|3|,...,^r and Y=(Y 1 ,...,Y T ) t . 



We obviously have = no for any I = 1 , . . . , k and the normalization condition ([3]) is 
ordinary satisfied for v n = no. Notice that the different groups of indices, Q] = {(j,t), t = 
1 , . . . , T} for j = 1 , . . . , p have all the same size T ; the index j points out the predictor X, for 
j = 1 , . . . , p and the index t is an indicator of the task of observation for t = 1 , . . . , T. Thanks 
to the block structure of the matrix X, the predictors are obviously orthogonal as soon as the 
tasks are different ; even the same variables observed at different tasks are orthogonal. We 
deduce that Ybt = 0. Moreover, denoting H , . . . , F T the sequence of Gram matrices associated 
to the T models given in ffTB"]) 



and Condition (jTTj) becomes Ybg < "V. 

This example is especially emblematic. In this context, the rank t in the group Q) is easily 
interpretable as a task. As well, condition (A3) is quite realistic since the coefficients (3(j )t ) on 
the predictor Xg )t ) can be assumed to slowly vary with the task. Furthermore, the separation 
introduced in subsection I2.2.2I between ybt arid Ybg ; which, in an implicit way assumes in 
condition (A2') that Ybt is a smaller quantity, naturally finds its interpretation here (since 
it is 0). 

3. GR-LOL : Grouping Research for Leaders 

Let us now describe the steps of our procedure. Once for all, we fix the constant v which 
is a quantity linked to the precision of the procedure ; take for instance v = 1 /2. 

Compute a bound for the number of leaders. Form V = X t X and compute Ybt>Ybg 
as defined in ([5]) and <Q. Deduce t* = t* Ybt + Ybg and N* = v (t*) _1 (see Definition 
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Search for the leaders. Form R{ = YJi=i Yi ( rewritten as R(j )t ) to take into account 
the group number) and compute, for any group Gj, j = 1 , . . . ,p, the quantity 

Pj = Y- R (i,t) := H R H|,2- 
fc=i,...,tj 

pj is an indicator of performance for the predictors whose indices are in the group G] 
to explain the target variable Y. Next, we consider the groups for which this indicator 
is high. More precisely, the sequence pj is sorted : 

pf 1) >...>pS)>...>pf p j 

and the group-leaders are the groups of predictors with group-indices in j £ B where 

B= {j=1, ...,p, pf,)> (pf N .)VA n (1) 2 )} (14) 

where A n (l) is a first tuning parameter. Denote Gb = U^bG)- Notice here that in the 
case where A n (l) 2 > p 2 ^, the leader indices set B is empty and our final estimate for 
|3 is zero. 

Observe also that #(£) < N* and #{Gb) < t*N* implying that 

<Qb) < N*(t* Ybt "I - Tbg) = "V- 

Regress on the leaders. We now perform the OLS on the block-leaders : 

= Argmin u ||Y-X ge u|| 2 = [X^Xg^Xg^. 

We then obtain the preliminary estimate (3 defined by 

&7 8 = to and 0gc=O 

Block thresholding We apply the second thresholding on the resulting estimated coeffi- 
cients : 

V£ = (j,t) e {l,...,k}, $ = & i{ \\$\\g u2 > ^S} 

v n i 

where A n (2) is the second tuning parameter. 
4. Results 

In this section, we provide a result on the convergence rate of GR-LOL procedure for 
a quadratic error on the estimation on the (3 coefficients on the regression model when the 
input parameters A n (l),A n (2) are properly chosen. 
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4-1. Rates of convergence 

The proof of the following theorem is given in Section [3 

Theorem 1. Fix y in (0, 1 ) and assume that A(1 ), A(2), A(2'), A(3) and A(4) are satisfied. 
Put 

A* = a(M 2 v n r* V (t* Vlogp)(1 +a/)) 1/2 . (15) 
Choose the thresholding levels A n (1),A n (2) such that 

A n (2)=c 2 A* and A n (1 ) = d A* V (2M v][ 2 t*/v) 

/or 

ci > c 2 , c 2 > 5v^k, ci > (4 + v~ 1/q ) 

and 

K = (1 -v) -1 V4(1 -vr 2 V2(2v 2 -v + 3)(1 -v)" 3 . 
There exists a positive constant C (depending on Ci,c 2 ,v and /VLJ snc/i i/iat 

E||0*-|3|||< C 

as soon as 

p < c- 1 v^ 2 (A*)"" exp (c b (A*) 2 (l A (rT 1 )) 

where 

c. = M-(^V3k) ^c^^L.A^. 

^.J?. Comments 

It worthwhile to notice that Theorem [1] rather clearly identifies the key features needed 
for our procedure to be sharp. Basically, it is depending on the structured sparsity as well 
as the size of the groups and the correlation structure within task and groups. 
Structured sparsity Concerning the structured sparsity of the coefficients, condition ( fl2l) 
reflects overall an homogeneousness inside the groups as well as a small number of 
'significant' groups. As is illustrated in Section 15. 1[ the algorithm has better rates if 
the large coefficients are gathered in the same groups, instead of being scattered in 
different groups. 

Size and correlation inside the groups A key quantity is t* = t*yBT + Ybg- 111 parti- 
cular, this quantity gives clear some indication to optimize the procedure when the 
structure is not a priori given by the problem. This is detailed in the following section. 



Vt, 
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4-3. A specific example : No-group case 

In the no-group case, the performances stated in the previous theorem are similar to 



those achieved by the standard LOL procedure studied in iMougeot et al.l ( 120121 ) . Actually, 
in the no-group case, p = k ; recall that v n = n, rif = n, t* = 1 , y B j = and that t* = ybg 
is the coherence of the matrix V. Observe also that in this case r* = (t*) 2 . Condition A(3) 
(see (lT2"j) ) is here the usual l q condition. Applying Theorem [TJ, we choose 

Ml) = ci (n 1/2 y BG V v/logk) and A n (2) = c 2 (n 1/2 y BG V yjtegk 
for constants < C2 < Ci large enough and we get 



^<c(y BG V 1( 



n 



under the condition 

k < c (n q/2 (ny 2 G V log k)- q/2 ) exp (c ny 2 G V log k) 

which writes as a lower bound for the constants above when ny 2 ^ ~ logk. There is no 
limitation on k except logk/n < C. In this case, the rate is minimax. 

4-4- An more interesting example : Multi-task case 

In the multi-task case, we observe no observations issued from p variables on T tasks 
units. We have 

v n = rv , Tic = n , t* = T, Ybg = 

and t* = Ybg is the maximum of the coherences associated to the different Gram sub-matrices 
. . . , F T . As previously, we get r* = (t*) 2 . Choosing 

A n (1 ) = Cl (uJ /2 Ybg VVTV v / i°ip) and A n(2) = c 2 (n /2 y BG V Vl V x/^gp) 
for constants < Ci < Ci large enough and we get 

E||^-Plli<cfyI G vlv!^) , " /2 
V n n / 

under the condition 

P < c (n q/2 ( n y 2 BG V T V logp)- q/2 ) exp (c ( n Y 2 BG VT Vlogp) ) 

yielding a lower bound for the constants here above when no y B g ~ T ~ log p. Observe there 
is no limitation on p. 
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4-5. 'Minimaxity', comparisons 

In this part, we use Theorem [1] to evaluate the quality of our procedure in various cases 
Mini max-n o group In the no-group case, minimax bounds are known (see 



Raskutti et al 



( 2011 )) and our procedure achieves this bounds as soon as Ybg = t* < O ^y/logk/n 
Still minimax when grouping For any q < 1 , we obviously have 



Liipiis,,i<Ziipi 

i=i j=i 



q 



Hence, as soon as t* < O I ylog k/n) which is satisfied for instance if 



Ybt ^ 0, y BG < O (y/\og k/nj and t* < logk, 



the GR -LOL procedure is still minimax using again the lower bound given in lRaskutti et al 
( 1201 ll ). 

Wavelet coefficients Let us consider the standard case of the signal model where k < n 
and where the (3's are the wavelet coefficients of the unknown signal. Observe that 
the condition ||(3|| q < M. for q < 1 corresponds to belonging of the signal to a ball of 
the Besov space B q (q Hence Theorem [1] proves that GR-LOL is minimax for any 
grouping strategy such that 

t* < O (yiogk/u) and t* < O (logk) . 

This is an extension of the block thresholding strategies which are generally performed 
with blocks chos e n insi d e each multiresolutio n level (see for instance, among many 
others 



Hall et al. 



(11998ft . lOai and Zhoul (l2009f n. 

Comparison with other structured sparsity conditions Our conditions involving simple 
correlation quantities on the regressors ar e quite difficul t to co mpare with more invol- 



ved conditions of ge ometric nature, as i n 



Lounici et al. 



( 120111 ) or of structured sparse 



coding nature as in iHuang et al.l (120091 ) for instance. Let us just mention that these 
conditions are very likely to be stronger than other ones, as it is the case in the no- 
group case compared to RIP conditions. However they have the advantage of being 
checkable on the data and they are readable enough to give directions to optimize the 
procedure. This point is developed in the sequel providing an algorithm to determine 
the groups. 
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5. Boosting the rates using grouping 



Generally in structured sparsity frameworks, the grouping is coming from the data, as it is 
the case for instance in the multitask case. However, in many situations there is no indication 
for such a 'natural' grouping. Our purpose is to explain how proceed for boosting the rates 
using grouping. We investigate different ideas for grouping strategies and in subsection 15.44 
a new grouping (auto driven) procedure called "Boosting Rates Gathering" is provided. To 
better introduce the BRG algorithm, we first detail an example explaining what gain can be 
expected by a suitable grouping and to what extend. 

To simplify (but with obvious generalization), we assume that v n = n in this section. 

5.1. Grouping versus non-grouping 

Consider a model such that the Gram matrix V is such that y > a/ (logk)/n (which is 
the standard case). 

• Use GR-LOL. First, assume that the grouping is such that ybt < yA* and Ybg < Y- 
Assume in addition that t* < c[y _1 Vny 2 ] for some positive constant c < 1. We see below 
that these conditions can automatically be ensured by the following BRG algorithms. 

Consider the case where 



Condition (A3) is then fulfilled with M = 1, Then applying Proposition 017|) . the predicted 
error is bounded by Cy 2 ~ q . 

• Use LOL. Second, we use LOL (corresponding to GR-LOL in the no-group case) and 
we denote |3® the estimate obtained using this second algorithm. Since X = nr 1//2 X, recall 
that 




Y if I 6 Q\ U ... U </|_(yt*)-ij 
else 



So we have |3 { ^ 0} < t* L(yt*)" q J. Since 



Y_ (t* y ) q <(yt*r q (t*Y) q = i 



j<L( Y t*)-qj 




IT 






e'=1,...,k,«'^,Pe'^0 



L-1 



12 



We deduce 



R f = at + b f + hi 

where 

|a £ | < u 1/2 y, |b e | < n 1/2 y 2 (t*L(ytT q J) < 2u 1/2 y 

and £,£ is distributed as a centered gaussian distribution of variance 1 . Choose now in Theorem 
E -ViO) > 5n 1//2 y (this choice is compatible with the assumptions in there). Then we get, 
for any index I associated with a non zero coefficient (3^ 

P(|R«I < A n (1)) > 1 — P(|R f — ERf| > 2n 1/2 y) > 1 - exp 

which can be bounded below by 0.5 for y^ny larger than an absolute constant. Since 
l|3f — (3d > IP® — (3f| I(0b = ||3{| I^b = y I|R E |<A n (i), 

we deduce 

H||0 @ -(3|| 2 >O.5y 2 (t*(L(yt*r q J) 

and the predictor error is always larger than 0.5 (t*) 1_q y 2 ~ q . So the prediction using grouping 
gives an average error smaller by a factor of (t*) 1_q which can rapidly be substantially large 
when t* itself grows. 

Observe also that the first procedure takes benefit of the fact that the 'big' (here the 
non zero) (3's are 'gathered' in the same groups. If instead, we have a configuration with the 
same final number of (3's, all equal to y, but scattered all in different groups, then condition 
(A3) is no longer satisfied and the group procedure achieves a lower rate. Actually a closer 
look at the proofs shows that the rate is the same as obtained by the LOL procedure. 

5.2. Gathering 

A natural idea coming from the example above is to 'gather' in the same group the indices 
Vs with R{ substantially big or of the same size. This obviously helps to decrease the number 
of groups which is an important issue. Natural ways to proceed are the gathering procedures 
below. 

- (GGa) Gathered Grouping with absolute correlation : this procedure gathers, in each 
group, variables exhibiting similar absolute value |R{| of the correlation coefficients with 
the target Y. The p different groups are then successively filled by using the ordered 
indices : 

0i={(l),...,(LVpJ)}, ••• ,£ P ={(k-Lk/pJ),...,(k)} 



2 
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where (i) denotes the index associated to the ranking quantity \R{i)\. 

- (GGc) Gathered Grouping with correlation : it is the same procedure as (GGa) but 
using the TVs instead of the absolute value |R«|'s. 

In view to explore in practice the benefit of these grouping strategies (see the next section), 
we also introduce 

- (GGr) Random Gathered Grouping : this procedure gathers, in each group, k/p va- 
riables randomly chosen among the k regressors. 

5.3. Taking into account the coherence and the size t* 

If we look at the convergence results of Theorem [I] in view to boost the rates, we observe 
that not only the structured sparsity is important but also that the following quantity has 
to be optimized 



Looking at this quantity gives some indications for choosing a procedure. First t* has to be 
smaller than logp if possible. This obviously induces to choose balanced groups. Looking 
now at the quantity t* = t* Ybt + Ybg indicates that the rates would benefit of choosing 
groups in such a way that Ybt is as small as possible. As a consequence Ybg is equal to 
the maximal correlation y. This observation gives rise to the following strategy. Divide the 
columns of X into two sets : Si of the items which are highly correlated, S2 for the remaining, 
weakly correlated. Put Si all in 'Task' number 1 : we ensure then that Ybt is less than the 
maximal correlation within S2 while Ybg = Ymax = Y- Another way to describe this is that 
each columns of Si is the first point of a new group. This induces in the sequel the name of 
'delegate'. 

It now remains to answer the two questions : how to choose the number of groups (cardinal 
of Si) and how to fill up the groups after the choice of its delegate. The answers to these 
questions are obtained by balancing the quantities in (TIB"]) , and then using the gathering 
principle. A final remark is that the quantity y is generally a leading term. Let us now be 
more precise and describe BRG the procedure (Boosting Rates Gathering) 

5-4- BRG (Boosting Rates Gathering) 

5.4.I. Determination of the number p* of groups 

This is the first step of the BRG procedure. Since we choose to have balanced groups, 
it is equivalent to determine the number of groups p or the average size t* = k/p of the 




(16) 
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groups. Let us consider the following curves ij = g(u) and y = p(u) defined for u in [1,oo[, 

g(u) = k/u and p(u) = #{£ e {1 , . . . , k}, 31' G {£ + 1 , . . . , k} such that \V U '\ > y/u}. 

These curves intersect at a point Ui as illustrated in Figure [TJ Observe that p(u) represents 
the cardinality of the set Si (u) of correlated columns with correlation higher than y/u (and 
so parameterized by u), with associated characteristics t*(u), yBT( u ) = y/ u , Ybg = y- We 
are looking for u such that 

t*(u)y BT (u) < y BG t*(u)y/u < y <=^> u > u, 

since t*(u) = k/p(u) . Let us draw now the curve p(u)logp(u) and find the point U2 
verifying 

u 2 = inf{u > 0, p(u) logp(u) > k, }. 
Deciding that the number of groups is 

p* = [ui V u 2 J , 

we are sure that the leading quantity in ffTB]) is y at least as soon as y > c A/log p/n which 
is the standard case in high dimension. 

1000 
9O0 
800 
700 
600 
500 

300 
200 



Figure 1: X-axis : Common size ti . Y— axis : number p of groups. Solid line : g(u) = k/u. Dashed line : 
p(u) for p = 0.5,71 = 20% (see simulation part). Dot dashed line : p(u) * logp(u)). Dot lines : corresponding 
ui , U2 positions, n = 200, k = 1000, SNR = 5. We observe U2 < ui . 

5.4-2. Determination of the delegates 
The set of 'delegates' 

V ={l e{1,...,k}, 3f e{1,...,k}\{£} such that \T W \ > y/p*} 

is also identified with the 'task' t = 1. Each delegate is associated to one group. It remains 
to distribute the variables whose indices are not in the T) in the different p* groups. 




1.5 2 2.5 3 3.5 4 
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5-4-3. Completion of the groups 

The variable of rank one in each group Qj is a variable belonging to T>. The repartition 
is done in such a way that all the groups have the same (or almost the same) cardinality. 
In the same way as for the gathering Grouping procedures, we propose two versions for the 
Boosting Grouping : 

- (BGc) : We rearrange the groups by sorting the correlation indicators associated to 
the delegates : R(i) > . . . > R(p*)- This means that Q\ contains the delegate ly such that 
R{, = Y t X.£ 1 takes the largest correlation value (equal to R(i)) and Q v * has the delegate 
with the smallest R{ pt correlation value (equal to R( p *)). The groups are then built such 
that the R's are as homogeneous as possible in each group and as close as possible to 
their delegate. Grouping starts by ranking the remaining R's (i.e. not associated to a 
delegate) : R(i) > ... > R(k- P *)- We denote [l) the index associated to the quantity 
R(t). The p* different groups are then successively filled by using the ranking indices : 

a ={«i,(i),...,(LWj)-U ••• >S r ={V>0<-p*-LWJ)+i),..-,0<-p*)}. 

- (BGa) : It is the same procedure as (BGc) but using the |R|'s instead of the the R's. 
Notice that in this case, we have rearranged the groups by sorting the absolute value 
of the correlation indicators associated to the delegates. 

Again, to understand the improvement provided by the BGa and BGc in the next section, 
we also consider 

- (BGr) : the groups are filled up completed randomly. The k — p* variables are spread 
out randomly into the p* groups. 

5.5. Quality of BRG 

Let us now consider the estimator |3* of (3 obtained using the procedure GR-LOL com- 
bined with a pre-processing using BRG algorithm to form the groups. Applying Theorem [1] 
under the conditions of the theorem, it is easy to show that as soon as y > c[logp/n] 1,/2 

E||0*-p|g<C (y) 2 ~ q . (17) 

6. Simulation 

In this section, an extensive simulation study is conducted to explore the practical qua- 
lities of procedure GR-LOL as well as the Boosting Grouping (BRG) procedure. In the first 
part, we briefly describe the experimental design and the empirical tuning of the parameters 
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of the procedure. The second part is devoted to the study of the Boosting Grouping proce- 
dures comparing to the gathered procedures given in Section I5\2l and to the procedures GGr 
and BGr where the groups are filled randomly. Finally, GR-LOL procedure (with a pre BRG- 
processing) is c ompared with two oth er procedures : LOL and Group Lasso. The comparison 

r e con tribution of the grouping and 



Mougeot et al.l (120 12[ )) all ows to check t 



Yuan and Lin 



(|2006|)) allows to evaluate GR-LOL 



with LOL (see 

the comparison with the Group lasso (see 
with respect to this challenging procedure involving an important optimization step. 

6.1. Experimental design 

6.1.1. Generation of the variables 

The design matrix X is a standard Gaussian n x k matrix. Each column vector X.( is 
centered and normalized. The target observations Y are given by Y = X(3 + W where 

- |3 is a vector of size k whose coordinates are zero except S which are (3{ = (—1 ) bf \zi\ for 
I = 1 , . . . , S where the b's are i.i.d. Rademacher variables and the z's are i.i.d. A/"(5, 1 ) 
variables. 

- W are i.i.d. variables Af{0, a 2 ). The variance cr 2 of the noise is chosen such that the 
SNR (signal over noise ratio) is close to 5 which corresponds to a middle noise level. 

To introduce some dependency between the regressors, we choose randomly a set denoted 
1Z of size pa = L^KI °f variables among the k initial variables. Let us denote by M p the 
Pa x Pd correlation matrix such that M p (i, i) = I and M p (i, j) = p if i ^ ). Let V the 
eigenvector matrix and D the diagonal eigenvalue matrix of JVL P satisfying the singular value 
decomposition M p = VDV 1 . Simulating a random gaussian matrix Z of size n x pa, we 
compute X-ji = ZD 1//2 V t ; this resulting matrix has columns X^ and verifying cor(Xj, X^) = 
p as soon as I ^ I'. In order to study broad experiments, different proportion values (n = 
5%, 10%, 20%) as correlation values (p = 0.0,0.6,0.8) have been studied. This method has 
the advantage to tune accurately the number of correlated variables as well as the amount 
of correlation between the variables. 

6.1.2. Tuning parameters of the algorithms 

As usual for thresholding methods, parameters A n (l ) and A n (2) involved in the GR-LOL 
procedure are critical values quite hard to tune because they depend on constants which 
are not optimized and may not be available in practice. In this work, we tune them in an 
empirical way described as follows : 

Threshold A u (1). The first threshold is used to select the leader groups. Remember that 
at this stage, the number p of groups is known, (or determined by BG). Indeed, we do 
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not determine directly the level A n (1) but find the number po of leader groups which 
is equivalent. Rearrange the groups along the values of pj and denote Q^, . . . , Q( v ) 
the result of the ranking. More precisely, the group is associated to the quantity 
P(j) where py) is the jth element of the list p^j > . . . > p^. We also denote t(j) the 
cardinality of such a group and po is simply determined by 

Po Po+1 

y~ t ( j) < n and V| > n. 

When using grouping procedures, original variables are not handled directly but through 
groups. If an important variable (i.e. associated with a large coefficient of correlation 
with the target) belongs to a cluster among unimportant variables (associated with 
small coefficients), this variable may easily be unseen and killed during the first thre- 
sholding step. This procedure slightly differs from the LOL original procedure in being 
much less restrictive during the first thresholding step and allowing to finally keep 
more variables through the groups. 

Threshold A n (2). In order to compute the second thresholding step, we do not determine, 
as previously, directly the level A2(n) but find the number pi of finally retained groups 
which is equivalent. The second threshold A n (2) used for denoising is computed by 
5-fold cross-validation. A proportion of 80% of the observations are used to estimate 
the (3 coefficients. 

The po groups, kept after the first thresholding, are ranked using the l 2 -norm of their 
estimated coefficients, HPIIg,^- Each group, associated to the quantity |||3||g.2 is 
corresponding to the jth element of the list ||(3||g ( i),2 > ... > || P||g ( p ),2- The 20% 
remaining observations are used to sequentially compute the prediction error using the 
one, the jth first groups of the previous ranking list. Using a model involving the jth 
first groups, the prediction error is defined by \\Y-% \\ 2 2 where U ] = <7 (1) U. . .U£ (j) . The 
prediction error is averaged using the 5-fold cross-validation. Finally, the first groups 
corresponding to the minimum prediction error are kept. 

In Section I5T21 we use LOL and the Group Lasso algorithms which both tuning parameters 
as well. LOL algorithm is a particularly case of GR-LOL when the number of groups equals 
the number of variables i.e. p = k. For fair comparison, we use here for LOL the same 
algorithm as for GR-LOL in the case where p = k. (And s o we have here a slight difference 

(120 12l ) . ) For group Lasso, the number of 



with the LOLA procedure provided in lMougeot et al 



final groups is computed by cross-validation as described in (jYuan and Linl (120061 ). 



Ma et al. 
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(120071 ). iHuang et al.l ( 120101 )). As usual, the initial sample of observations is split into two 
samples : the training set contains 75% of the n observations and is used when the algorithms 
are running, the test set contains 25% of the n observations and is used for the cross- 
validation methods. 

6.1.3. Criterion to evaluate the quality of the method 

For each studied procedure P (P is either BG( Q)C)T ) or GG( Q)C)T )) with the prediction Y p , 
the relative prediction error E p = IIY- Y p ||f/||Y||? is computed on the target Y. The results 
presented in the tables give median values and standard deviations when K = 1 00 replications 
of the algorithms are performed. When GR-LOL is compared with another procedure P (P 
is either LOL or Group Lasso), the ratio E p /Ey R ~ L0L is computed. If the ratio is close to 1, 
the methods perform similarly ; when the ratio is larger than 1 , GR-LOL outperforms P. 

6.1.4- BRG : Number p of groups 

Recall that the first step of BRG consists in determining the number p* of groups, and 
is detailed in Section 15.4.11 Figure [2] shows the average size of the groups computed with 
the BRG procedure when the level of dependence between the regressors given by n and 
p are varying continuously. When no dependency is introduced in the design matrix, we 
observe that the groups contain in average t* = 1 .5 variables using the experimental design 
previously described. Observe that the size of the groups is increasing (and then the number 
p* of groups is decreasing) with the level of dependency between the regressors (with n or 
p). For example, for p = 0.8, the size of the group is almost multiplied by 2 as n decreases 
from 50% to 10%. 




Figure 2: Y-axis : Size of the groups. X— axis : correlation p between the regressors for n = 
10%, 20%, 30%, 40%, 50%. n = 200, k= 1000, SNR = 5, K = 100. 



For a fair comparison, the number p* of groups is the same for all the methods, only 
the repartition of the variables between the different groups varies. Defining the number of 
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groups is not an easy task. It should be underlined that in this case, the Random Grouping 
and the Gathered Grouping both benefit of the optimal and automatic choice of p* proposed 
by the boosting strategy It should also be noticed that the Gathered and Boosting Grouping 
algorithms provide very different configurations for the groups, as the average size t* of the 
groups is small. 

6.1.5. Impact on the Coherence 

The empirical coherence Ybg ; Ybt and Y are computed and shown in Tabled] for different 
value of correlation {n — 0%, 20%, 40% ; p = 0.0, 0.6, 0.8) and for all the considered grouping 
strategies. For each simulation, we have y =sup(ybt> Ybg)- As the results presented in table 
[T]are averaged over K = 100 replications, we do not find necessarily at the end this property, 
especially for Gathering grouping (GGa, GGc, GGr) which can provide very different groups 
each time. 

As expected, the boosting strategies induce a strong decrease of Ybt as soon as there 
exists some dependency (n > 20%, p > 0.6). The different strategies for filling the groups 
(BGr, BGc, BGa) does not have however any influence on Ybg as expected also. The gathered 
groupings (GGc, GGa) do not help to reduce Ybt and T*. As expected, the empirical coherence 
(denoted y in the theoretical part) is increasing with the dependence level p. Table [1] shows 
also the empirical value of t* = t*YBT + Ybg and r* = t*Y B j + Ybg computed for different 
strategies. 

6.1.6. Benefits of boosting grouping 

Table [2] compares the random (GGr), Gathered (GGc, GGa) and boosting grouping (BGr, 
BGc, BGa) for different sparsities S and different levels of dependence (p, 7t). Let us first 
comment the no-dependency case (n = 0). When the sparsity is high (S = 10, 20, 30), similar 
performances are obtained for any grouping strategy. Underline that even building the groups 
in a completely random manner is not a bad strategy. When the sparsity is low (S = 40,50), 
the Gathered Groupings (GGa and GGc) bring the best results with a weak variability (low 
standard deviation). As there is no specific correlation between the regressors, the boosting 
procedure brings as expected in this case no added value. 

Actually, the boosting grouping procedure is especially adapted to large correlation for 
taking advantage. For instance, when p and 7t are significative (p = 0.6 and n = OA), the 
boosting procedure clearly shows substantial benefits. However, the performances of the boos- 
ting depends on the strategy for filling the groups. When the number of correlated variables 
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7t = 0%, p 


= 0.0 




t* 


Tbt 


Tbg 




y 


T* 


r* 


GGr 




1.40 


(0.06) 


0.321 (0.030) 


0.317 (0.014) 


0.327 


(0.015) 


0.766 


0.245 


GGc 




1.40 


(0.06) 


0.318 (0.029) 


0.321 (0.016) 


0.327 


(0.015) 


0.766 


0.245 


GGa 




1.40 


(0.06) 


0.318 (0.029) 


0.319 (0.017) 


0.327 


(0.015) 


0.764 


0.243 


BGr 




1.40 


(0.06) 


0.234 (0.016) 


0.327 (0.015) 


0.327 


(0.015) 


0.655 


0.184 


BGc 




1.40 


(0.06) 


0.234 (0.015) 


0.327 (0.015) 


0.327 


(0.015) 


0.655 


0.184 


BGa 




1.40 


(0.06) 


0.234 (0.015) 


0.327 (0.015) 


0.327 


(0.015) 


0.655 


0.184 


7T = 20%, p 


= 0.6 




t* 


Tbt 


Tbg 




t 


T 


T* 


GGr 




2.80 


(0.07) 


0.731 (0.029) 


0.723 (0.020) 


0.733 


(0.018) 


2.770 


2.019 


GGc 




2.80 


(0.07) 


0.730 (0.032) 


0.726 (0.019) 


0.733 


(0.018) 


2.770 


2.019 


GGa 




2.80 


(0.07) 


0.730 (0.032) 


0.724 (0.019) 


0.733 


(0.018) 


2.768 


2.016 


BGr 




2.80 


(0.07) 


0.260 (0.018) 


0.733 (0.018) 


0.733 


(0.018) 


1.460 


0.726 


BGc 




2.80 


(0.07) 


0.260 (0.016) 


0.733 (0.018) 


0.733 


(0.018) 


1.460 


0.726 


BGa 




2.80 


(0.07) 


0.260 (0.016) 


0.733 (0.018) 


0.733 


(0.018) 


1.460 


0.726 


7T = 40%, p 


= 0.6 




t* 


Tbt 


Tbg 




y 


T 


T* 


GGr 




2.40 


(0.02) 


0.742 (0.030) 


0.739 (0.019) 


0.746 


(0.019) 


2.521 


1.869 


GGc 




2.40 


(0.02) 


0.742 (0.031) 


0.740 (0.019) 


0.746 


(0.019) 


2.522 


1.870 


GGa 




2.40 


(0.02) 


0.741 (0.031) 


0.740 (0.019) 


0.746 


(0.019) 


2.519 


1.866 


BGr 




2.40 


(0.02) 


0.303 (0.024) 


0.746 (0.019) 


0.746 


(0.019) 


1.474 


0.778 


BGc 




2.40 


(0.02) 


0.303 (0.024) 


0.746 (0.019) 


0.746 


(0.019) 


1.474 


0.778 


BGa 




2.40 


(0.02) 


0.303 (0.021) 


0.746 (0.019) 


0.746 


(0.019) 


1.473 


0.777 



7t = 40%, p = 0.8 



Tbt 



Tbg 



T 



GGr 
GGc 
GGa 
BGr 
BGc 
BGa 



2.50 
2.50 
2.50 
2.50 
2.50 
2.50 



0.03) 
0.03) 
0.03) 
0.03) 
0.03) 
0.03) 



0.868 (0.016) 
0.867 (0.017) 
0.867 (0.017) 
0.315 (0.025) 
0.314 (0.028) 
0.317 (0.025) 



0.866 (0.011) 
0.867 (0.011) 
0.867 (0.011) 
0.869 (0.010) 
0.869 (0.010) 
0.869 (0.010) 



0.869 
0.869 
0.869 
0.869 
0.869 
0.869 



0.011) 
0.011) 
0.011) 
0.010) 
0.010) 
0.010) 



3.035 
3.035 
3.035 
1.656 
1.654 
1.662 



2.631 
2.632 
2.632 
1.003 
1.002 
1.007 



Table 1: First line : Empirical coherence Ybg , Tbt, T computed when the groups are built using the different 
strategies. SNR = 5, n = 200, k = 1000, K = 100, tt = 40%. 



is weak (n = 0.2), the boosting associated with groups filled randomly (BGr) is rather com- 
petitive compared to Gathered groupings (GGc, GGa). However, the boosting procedures 
with groups filled homogeneously always show the best performances (BGc, BGa) with a 
preference for the absolute value criteria. When there are strong correlations between the 
regressors p = 0.6, 0.8, the boosting procedures (BRc, BGa) clearly outperforms the random 
and the Gathered grouping, and this is even true when the groups are filled randomly (BGR). 
BGa always brings the best results when the sparsity S increases and/or the correlation p 
between the regressors increases. 
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7t = 0%,p = S = 10 S = 20 S = 30 S =40 S = 50 

GGr 3.06 ( 0.02) 6.32 ( 0.04) 11.36 ( 0.07) 14.11 ( 0.09) 14.53 ( 0.10) 

GGc 3.18 ( 0.02) 5.18 ( 0.03) 7.86 ( 0.04) 9.81 ( 0.08) 10.36 ( 0.06) 

GGa 2.97 ( 0.02) 4.93 ( 0.02) 7.62 ( 0.04) 9.19 ( 0.07) 10.31 ( 0.06) 

BGr 3.09 ( 0.02) 7.09 ( 0.04) 10.62 ( 0.06) 11.64 ( 0.07) 14.21 ( 0.09) 

BGc 3.15 ( 0.02) 5.71 ( 0.05) 9.84 ( 0.06) 12.30 ( 0.07) 12.31 ( 0.07) 

BGa 3.04 ( 0.02) 5.33 ( 0.03) 8.32 ( 0.05) 10.12 ( 0.06) 10.97 ( 0.07) 

7t = 20%,p = 0.6 S = 10 S = 20 S=30 S = 40 S = 50 

GGr 8.85 ( 0.16) 28.82 ( 0.22) 36.43 ( 0.23) 40.61 ( 0.24) 44.56 ( 0.23) 

GGc 7.52 ( 0.18) 21.61 ( 0.24) 41.72 ( 0.26) 38.04 ( 0.26) 37.50 ( 0.24) 

GGa 7.93 ( 0.17) 24.43 ( 0.24) 33.26 ( 0.26) 40.67 ( 0.27) 39.98 ( 0.24) 

BGr 9.35 ( 0.13) 20.28 ( 0.17) 25.35 ( 0.19) 35.28 ( 0.20) 32.80 ( 0.17) 

BGc 7.22 ( 0.10) 14.41 ( 0.15) 24.76 ( 0.16) 25.50 ( 0.18) 27.43 ( 0.19) 

BGa 6.04 ( 0.05) 10.02 ( 0.06) 14.14 ( 0.10) 20.96 ( 0.13) 20.35 ( 0.14) 

7t = 40%,p = 0.6 S = 10 S = 20 S = 30 S = 40 S = 50 

GGr 19.78 ( 0.19) 31.66 ( 0.20) 39.23 ( 0.21) 45.73 ( 0.22) 44.66 ( 0.21) 

GGc 17.74 ( 0.18) 38.77 ( 0.22) 38.96 ( 0.23) 51.72 ( 0.22) 51.14 ( 0.23) 

GGa 18.28 ( 0.19) 40.82 ( 0.22) 43.42 ( 0.21) 59.12 ( 0.22) 54.80 ( 0.23) 

BGr 10.51 ( 0.07) 17.34 ( 0.11) 24.71 ( 0.13) 30.16 ( 0.16) 31.15 ( 0.19) 

BGc 9.48 ( 0.09) 19.03 ( 0.14) 24.26 ( 0.16) 30.14 ( 0.17) 31.27 ( 0.18) 

BGa 7.51 ( 0.06) 10.43 ( 0.07) 16.30 ( 0.09) 20.63 ( 0.12) 23.41 (0.13) 

7t = 40%,p = 0.8 S = 10 S = 20 S = 30 S = 40 S = 50 

GGr 29.75 ( 0.20) 43.95 ( 0.23) 39.27 ( 0.23) 48.75 ( 0.22) 48.80 ( 0.27) 

GGc 37.59 ( 0.22) 49.77 ( 0.25) 49.81 ( 0.26) 57.23 ( 0.24) 53.57 ( 0.26) 

GGa 36.69 ( 0.21) 51.53 ( 0.25) 50.64 ( 0.26) 59.99 ( 0.25) 60.64 ( 0.26) 

BGr 7.85 ( 0.05) 13.95 ( 0.08) 18.14 ( 0.13) 19.82 ( 0.17) 26.48 ( 0.19) 

BGc 8.33 ( 0.07) 14.93 ( 0.11) 20.52 ( 0.15) 21.41 ( 0.17) 28.95 ( 0.18) 

BGa 5.96 ( 0.05) 9.19 ( 0.05) 12.72 ( 0.10) 16.26 ( 0.13) 19.44 ( 0.17) 



Table 2: Relative prediction errors Ey (x 100) for Boosting Grouping (BGr, BGc, BGa), Gathered Grouping 
(GGc, GGa) and Random Grouping (GGr) when the sparsity is varying, for various levels of dependency 
given by 7T, p. SNR = 5, n = 200, k = 1000, K = 100. 



6.2. Study of the GR-LOL procedure 

In this part, we present the performance results when GR-LOL procedure associated with 
the Boosting Grouping strategy (BGa) is applied on the experimental design presented above. 
Comparisons between GR-LOL and LOL on the one hand, and GR-LOL and Group-lasso 
on the second hand are explored. 

6.2.1. GR-LOL versus LOL 

The main difference between LOL and GR-LOL is that GR-LOL manipulates groups of 
variables while LOL procedure handles the variables directly. Table [3] shows a comparison of 
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the performances obtained for LOL for the same experimental design as above. 



7t 


P 


S = 10 


S =20 


S = 30 


S = 40 


S = 50 


0% 


0.00 


1.083 


1.522 


1.616 


1.777 


1.666 


20% 


0.60 


1.342 


2.854 


3.572 


2.636 


2.414 


20% 


0.80 


1.877 


5.436 


3.898 


3.117 


2.649 


40% 


0.60 


3.607 


4.341 


3.715 


2.856 


2.410 


40% 


0.80 


6.287 


6.429 


4.773 


3.440 


3.417 



Table 3: Relative prediction errors ratio EY OL /Ey R LOL for LOL and GR-LOL when the sparsity is varying 
for different correlation values p = 0.0, 0.6, 0.8 and rates 7t = 0, 0.2, 0.4. SNR = 5, n = 200, k = 1000. 



We observe that LOL procedure performs particularly well w hen the sparsity is lar ge (S 



small) and when the dependence between the regressors is weak (IMougeot et al.l (120121 )). In 
this case, GR-LOL brings no improvement compared to LOL. Observe that, if there is no 
dependency (case where p = 0.0), the grouping improves the performances of LOL when 
the sparsity decreases (S increases). If the dependency increases (case where p = 0.6,0.8), 
GR-LOL always outperforms LOL for any considered sparsity. 

6.2.2. GR-LOL versus Group-lasso 

The group Lasso is one of the most popular procedure for penalized regression with 
grouping variables so we choose this method to challenge the boosting Grouping procedure. 
To be fair, for both procedures, the groups are built using the boosting strategy (BGa) and 
cross-validation are both used to determine the final model. 

Comparison of prediction results are given by Table HI Both procedures show similar 
behaviors in two cases : when there is no high correlation between the co variables (n = 0) 
or when the sparsity (S = 50) is small. In the other cases (especially when the sparsity is 
large i.e. S small), the results given by GR-LOL are excellent : GR-LOL always outperforms 
the group lasso. 

To end this comparison, let us give a few words about computational aspects. The Group 
Lasso algorithm is based on an optimization procedure which can be time consuming while 
GR-LOL procedure solves the penalized regression using two thresholding steps and a clas- 
sical regression. Regarding the complexity of the different methods, GR-LOL has a strong 
advantage over the Group Lasso. 
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7T 


P 


S = 10 


S = 20 


S =30 


S =40 


S = 50 


0% 


0.0 


1.228 


1.318 


1.143 


1.827 


2.001 


5% 


0.6 


4.584 


2.366 


1.470 


1.944 


1.706 


5% 


0.8 


5.179 


2.490 


1.937 


1.122 


0.829 


10% 


0.6 


2.764 


3.124 


1.892 


1.825 


0.967 


10% 


0.8 


4.744 


1.643 


1.824 


1.511 


0.739 


20% 


0.6 


2.176 


3.032 


1.764 


1.385 


1.426 


20% 


0.8 


3.250 


3.015 


1.986 


1.098 


1.048 



Table 4: Relative prediction errors ratio E^ l<1SS0 /Ey R LOL for GR-LOL and GLasso when the sparsity is 
varying, for various levels of dependency given by 71, p. SNR = 5, n = 200, k = 1000. 

6.3. Conclusion 

This experimental study shows that true benefits can be obtained using a grouping ap- 
proach for penalized regression even in the case where there is no prior knowledge on the 
groups. However, the results are highly relying on the grouping strategy. The boosting stra- 
tegy brings a nice answer to the grouping problem when no prior information is available on 
the structured sparsity. This strategy is very easy to implement and especially well adapted 
when a strong correlation exists between the regressors in the case of high sparsity (S small). 

7. Proofs 

7.1. RIP and associated properties : x* -conditions 

In this part, we collect properties which are linked with the coherence T*. All these 
inequalities are extensively used in the proof of Theorem [1] and the proofs of the propositions 
stated in Section 17.21 ; their proofs are detailed in the appendix. 

Recall that for X C {1 , . . . , k}, Pr = XjXj is the associated Gram matrix of Xj. Xj is the 
matrix restricted to the columns of X whose indices are in X. Denote by Pv x the projection 
on the space Vj spanned by the predictors X { whose indices i belong to X. We also denote 
a(X) the vector of M #t2:) , such that 

X x oc(X) =P Vl [Xa]. (18) 

As well, we define &(X) the vector of M.^ x \ such that 

X Z &(X) = PvJY]. (19) 

The following lemma describes the 'bloc-diagonal' aspect of the Gram matrices V% at 
least when the set of indices X is small enough. It is corresponding to the 'group-version' of 
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the link between coh erence and RIP property (see for instance the corresponding result in 



Mougeot et all toim . 



Lemma 1. (RIP-property) Let < v < 1 be fixed. Let I be a subset o/{1 , . . . , k} such that 
T (X) < "V. Then we get 

Vx G ||x||§(1-v) <x*r z x< ||x|||(l +v). (20) 

We deduce that the Gram matrix Pj is almost diagonal and in particular invertible as 
soon as t(X) < v. When this upper bound on t(X) holds, we also extensively use the RIP 
Property ( 1201 in the following forms : 

Vx e R* {1 \ ||x||f (1 + v)- 1 < x* V~ 1 x < (1 - \\x\\ 2 2 , (21) 

and 

Vx e R* [I \ (1 -v)||x||| < ||^xi X.i \\l < (1 +y)\\x\\ 2 2 . (22) 
We also need the following lemma 

Lemma 2. For any X subset o/{1 , . . . , k} sitc/i i/iai t(Z) < "V, we /lave 

VxGT, (1 H-v)- 1 ^ (ll***) < l|Pv x x||l < (1 1 ^ T^XtXtA . (23) 

i ei V 1=1 / e ex \ 1=1 / 

7.2. Behavior of the projectors : r* -conditions 

In this subsection, we describe properties of the projection which are more general as in 
the previous part where the results were linked to the RIP property. These properties depend 
on the index r* = t* y\j + Ybg- ^ * s noteworthy to observe that in the no-group setting, we 
do not need to introduce this indicator r* since in this case r* = (t*) 2 . Hence this is one of 
the precise place where the grouping induces different argument. 

Let now state the following different technical results, which are essential in the sequel. 

Lemma 3. Let X, C be subsets of {1 , . . . , k} and put 

B[C)t = 2^ r^'OQ' 

for any I. Then, we have 

||B(C)||| )2 < 2 Hallux) 

where r(X) is defined in (11). 
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Proposition 1. For any integer j from {!,... , p}, we get 

| ll*lk,2 - IM| ej)2 | 2 < 4M 2 v n r(^) +2(1 + v)||P Vfl W|| 2 . 

Proposition 2. For any subset I of the leaders indices set Qq, there exists k depending on 
v such that 

||a-cx||| )2 I{Xc^}< k (||a|| 2 r(X) + ||P Vl W|| 2 + \\? Vg W\\ 2 2 r(X^ . 
More precisely 

1 6 w 4(2v 2 -v + 2) 

K > V — — V 



1-v (1-v) 3 (1-v) 4 
Proposition 3. Let I be a non random subset such that #(Z) < rij, where Tlx is a deter- 
ministic quantity, then 



P (^jIIPv^W]!! 2 > z 2 ^j < exp (-z 2 /16) 



(24) 



/or any z swc/i that z 1 > 4nx- If now X is a random subset of the form {(j,t), j G A, 1 < 
t < tj} where A is a random set o/{1 , . . . , p} o/ cardinal less than L f deterministic constant), 
Inequality is still true but for any z such that z 1 > 1 6 L (t* V log p) . In particular, this 
implies that for such a set, for any k > 1 ; there exists a constant Cy_ such that 

e(1||P Vi [W]|| 2 ) < C k L k (t*Vlogp) k . (25) 

7. 3. Proof of the Theorem 

Thanks to Condition ([3]), we have 

av n Hp** - p||| < || &* - a||| < bv n ||p** - $\\\ 

which allows us to focus on the estimation error ||ft* — cx||a- We have 

||ft* - <x||l = ||ft* - a||| B>2 + ||<x|| 2 gB)s2 := I (In) + O (Out). 

We split I into four terms : 

I = ^I{||ft|| 0J ,2 > A n (2)}I{||a|| gj , 2 > A n (2)/2} ||ft- a|| 2 . 2 

+ £l{||ft|| 6j ,2 > A n (2)}I{||a|| ej , 2 < A n (2)/2} ||&-a|| 2 ji2 

+ ^I{||ft|| e . 2 < A n (2)}I{||a|| 6 . 2 > 2A n (2)} ||a|| 2 . 2 

)eB 

+ ^I{||ft(|| 6 . 2 < A n (2)}I{||a|| g . 2 < 2A n (2)} ||a|| 2 . 2 

:= IBB (InBigBig) + IBS (InBigSmall) + ISB (InSmallBig) + ISS (InSmallSmall) . 
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We have on the other hand, 



0<^I{||R|| ej , 2 


< M1)}I{||a| 


| gj , 2 >2A n (1)}||a|| 2 j)2 










< A n (1)}I{||a| 


| e . 2 <2A n (1)} ||a|| 2 . 2 


jet3 c 








>A n (1)}I{||a| 


| 6j ,2>A n (l)/2} ||cx|| 2 p2 










>A n (1)}I{||a| 


| gj , 2 <A n (1)/2} ||a|| 2 ji2 









:= OSB (OutSmallBig) + OSS (OutSmallSmall) + OBB (OutBigBig) + OBS (OutBigSmall) 
7.3.1. Study of IBB and ISB 

Let us first study ISB. Observe that the two conditions 11^11^,2 < A n (2) and ||<x||g. i2 > 
2AJ2) imply \\6l\\ S] ,2 < \\oi\\g^ 2 /2. We deduce that 

||&-cx||g^> ||a|| s ., 2 - \\&\\g h 2 > 11^11^,2/2 

and then 

ISB < 4^I{||a||g j , 2 >2A n (2)}||&-a|| 2 . 2 =4||a-a|| 2 :neBi2 (26) 

where 

X :={(), t)e{l,...,lc}, ||<x|| g . 2 > 2AJ2)}. 
Thanks to Condition |T2l we get 

ne (2] :=#({j, 3t, (j,t)el}) < GC}, ((2A n (2))- 1 ||a||g j , 2 ) q < (2A n (2))- q M<W?/ 2 

j=i 

and we bound #(X) by t* x 3t, (j,t) G X}). It follows that 

r(X) < M q v?/ 2 (2A n (2))-" [y 2 BG + t*y 2 BT ] 
< M q v q/2 (2A n (2)r q r*. 

Using successively Proposition [2] and Proposition El we get 

E(ISB) < 4kE (M 2 v n r(X) + \\? Vx [W] || 2 + r(X)||P Vee [W] || 2 ) 

< 4k ([M 2 v n r(X) + C^tf Vlogp][n G (X) +r(X)N*]) 

< [4 + 2C 1 ]K(M q v q/2 (2A n (2))- q ) (A*) 2 

where A* is defined in f|T5|) and because r*N* < t*N* < v. The bound given in (126]) is valid 
for IBB and then the proof also holds for IBB. 
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7.3.2. Study of OSS, OBS and ISS 

Let q be such that Condition (fl2"|) is satisfied 



OSS < £l{||cx|| g . 2 < 2A n (l)} Wocfg-p < (2A n (l)) 2 -^ \\oc\\^ 
j=i j=i 

< MX /2 (2A n (1)) 2 " q 

Note that this proof can also be performed for OBS and ISS since A n (1) > A n (2). 

7.3.3. Study of OSB 
Since 

l|a||0 J)2 =(||a|| g . 2 -||R|| s . 2 ) + ||R|| 6 . 2 

we get 

OSB < 2 Y_ M«\k,2 - \\n Sj ,2 > And )} (lk|| gj , 2 - ||R|| gj , 2 ) 2 

+ 2^1{||a|| ej , 2 - ||R|| 0j)2 > A U (1)}I{||R|| S . 2 < A n (1)} ||R|| 2 g j)2 , 

and by Cauchy-Schwarz 

r ,-i 1/2 

E(OSB) < lY_ [P ( | \\oc\\g j>2 - ||R|| ej ,2| > A n (D) E (||a|| s . 2 - ||R|| gj , 2 ) 4 
j<p 

+ 2A n (l ) 2 ^P(| ||a|| e . 2 - ||R|| e . 2 | > And )) • 

j<p 

On the one hand, as an immediate consequence of Propositions [TJ and [31 we get 

E ( ll*lk,2 - NM* < 32M 4 v 2 x(£ j ) 2 + 16cr 4 (1 + v) 2 #{Gi) 2 . 

Since #({?j) < t* and t(^) < t*, we bound this term by 32(A*) 4 . On the other hand, usinj 
Proposition [31 

P( I I|RM-Il«||^| > A) < p(||P Vs W|| 2 > A/2(l +y) ]/1 

< exp (-A 2 /32(1 +v)) 

as soon as 

A 2 > (8M 2 v n r(^)) V(16(l +y) [t* Vlogp]). 
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This condition is verified by A n (1 ) as soon as 



A n (1)>4A* (27) 

and it follows 

9 /A ("P 2 

E(OSB) < -pA n (1) 2 exp 



4 r IU ' r \ 64(1 + y 

7.3.4. Study o/OBB 

Let us decompose again this term into 2 different ones, 

OBB = Y_ ll a ll|,2 + X. H a ll^>2 := 0BBl + 0BB2 

jec, jec 2 

where 

C ={j e £ c , ||a|| g . 2 > 2A n (1), ||R|| g . 2 > A n (l)} 

and 

d =Cn{j,||R|| g . 2 < ||a|| g . 2 /2} and C 2 = C n {j, ||R||g. 2 > ||<x|| g . 2 /2} 
On the one hand, we obviously have 

d C{j G{1,...,p}, A n (1)< ||a|| e . 2 -||R||g. 2 } 

leading to 

p 

OBB, < Xl{| Mk,2 - ||R|| S) , 2 | > A n (1)} ||a|| 2 . 2 . 
j=i 

We conclude as for the term OSB. For OBBi the argument is slightly more subtle : on the 
other hand, 

j £ B and ||R||s,,2 > A n (1) => p 2 j} < p 2 N », 

(see Step 2 of the procedure) inducing that there exist at least N* leader indices )' 7^ j in 
{1 , . . . , p} such that ||R||g.,,2 > ||R||sj,2- Assume now that the following inequality is true (this 
will be proved later) : 

#(C) < N*. (28) 
This implies that there exists at least one index (depending on j) called such that 
||a||g r(j)) 2 < A u (1)/2 (because )*{)) g C) and ||R||g r(J) ,2 > 1^11^,2- 
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We deduce that, for this index we have 

\ms iHj) ,2-\Mg^ h 2 > ||R|| ejl2 -A n (l)/2 



> ll^llGj,2/2 (because j G C) 

> WotWg.^/l — A n (1)/2 (because j G C{\ 

> ll a llGj,2/4 (because ) EC). 



It follows that 

p 



OBB 2 <4^I{|||a|| gr(j) , 2 -||R||^ m , 2 | >A n (1)/2} ( \\oi\\ g .^ 2 - ||R|| e ., 0))2 ) 

and we conclude as for the term OSB. It remains now to prove (I25j) : thanks to Condition 

we get 



#(C)<#({jG{1,...,p}, ||cx|| e . 2 >2A u (l)}) 

<£l{j GC}, (2A n (ir 1 ||<x||g j , 2 ) q < (2A n (1)- 1 ) q M^f 

H 

and fl28|) is satisfied as soon as A n (l) > 2M.VJ/ 2 (N*)~ 1 / q which is verified for any q < 1 as 
soon as 

A n (l)>2Mvy 2 (N*)- 1 . (29) 

7.3.5. Study of IBS 

The triangular inequality for the norm ||.||g.,2 leads to 

IBS < ^I{||&- ot\\g u 2 > A n (2)/2} || a - a|| 2 . 2 . 
Using Cauchy Schwarz inequality we get 

_F_ / N 1/2 1/7 

E(IBS) < Y_ (E||a- a|| 4 g . 2 I{j G £}) P ( ||6fc- a|| e . 2 I{j G i3} > A n (2)/2) 1/2 . 
j=i 

On the one hand, by Propositions [2] and El we get 



E a|| 4 . 2 I{j G B}\ < 3k 2 (M 4 v 2 r(^) 2 + 2C 2 a 4 [1 +r(^) 2 [N*] 2 ] [t* Vlogp] 2 ) . 
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Since r(G) < r*, #(G) < t* and #(£7 B ) < N*t*, we bound this term by 64k 2 (A 



2 rA*l4 



N*t* < v) . On the other hand, using again Propositions [2] and [31 we have 
P(||<x-&|| 9 „2l{jeB} > A ) < exp ( 



S +exp ("2*i)) 



as soon as 



It follows that, if 



A 2 > 3k (M 2 v n r(^) V 16(1 + r(^)N*)[t* V logp]) 



An(2) > 5^A* 



we get 



E(IBS) < 3k P (A 



*\2 



exp 



192k J 



exp 



A 2 (2) 
192kt* 



7.3.6. End of the proof 

If we summarize the results obtained above, choosing 

A n (1) = Cl A* VpMv^lNT 1 ] and A n (2) = c 2 A* 

with Ci > C2, Ci > 5\/k and Ci > (4 + v~ 1/q ), we obtain 

E||&*-a|| 2 <2x [4 K (M q v q/2 (2A n (2)r q ) (A*) 2 ]+3x [M q v q/2 (2A n (1 )) 2 - q ] 



9 . fn2 / A n (1) 2 

4 pM1) ex H"64TTTv 



3k p(A 



*\2 



exp 



A 2 (2) 
64 k 



+ exp 



A 2 (2) 
192kt* 



< cv n ((AT"< /2 - 1 + (N*) q ~ 2 ) 



under the condition 



c Q p exp(-c b (A*)(1 Afr*)- 1 )) <v^ 2 (A*) _q 



where 



c Q = M q ( -c] V 3k ] and c b 



A ^ 



64(1 +v) 192k* 



Replacing A*, we obtain the announced result. 
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8. Appendix 

Recall that X% denotes the matrix restricted to the columns of X whose indices are in X 
subset of {1 , . . . , k} and that V x = X X X X . Denote Py x the projection on the space spanned by 
the predictors whose index I belongs to X 

Py x = Xi(XjXj) X x = X x {Vx) Xj. 

Recall that any index I of {1 , . . . , k} can be registered as a pair (j, t) where j is the index of 
the group Qj where I is belonging and t is the rank of I inside Q r 

8. 1 . Proof of Lemma 

We use the definitions ([5]) and of Ybt and y BG 

|B(C)||| = ^B(C)f = Y_ ( 21 r ^'' 

lex lex \i'ec,l'& 

- Y- yBT Y- i a (j',t')i +tbg 21 i a o',t) 

(J,t)ez \ 0',t')ec,tvt J'=W.(j'.t)ecjvJ 

2 



< 2y 2 BT 2j a «l +2y BG L 

(j,t)6J Vcec / j=i,...,p,(j,t)ex 



21 21 l^'.t) 

t=i,...,tj,(j,t)ei \j'=i,..., P ,(j',t)GC,jVj 



< 2y 2 BT #(X) l^l) + V BG Mb (j,t) e X}) W 

Vfec / Veec 



which ends the proof. 

8. 2. Proof of Lemma U\ 

Let us decompose the sum 



c,«'=i e=i e/p 



Using Condition fflUj) . it follows that 



(x* T X X - ||x|||| = 21 X f X <" • 
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In order to solve the difficulty due to the fact that the size tj of the groups Qj could be 
different, we consider that t is varying until t* = max(ti, . . .,t p ) with the convention that 
X(j )t ) = if the index (j,t) ^ X. Using Definition (jfJJ) and Definition (JSJ), we get 

|x*r(X)x- ||x||t 2 ( m )l <y B T Y l x (j',t)^(j',t')l +Ybg Y l x (j,t)^(j',t)l 

(j,t)ex,(j',t')ex,t^t' (j,t)ex,(j',t')ex,t=t' 

<Ybt Y- \ x (ht)\) +Ybg Y-\ H ' x mI 

\(j,t)6X / t=0 \j€{j,(j,t)€Z} 

t* 

< ybt #(X) Y_ Kj,t)l 2 + tbg Y #W> 0,t) e J}) l x o.t)l 2 

(j,t)el t=0 j6{j,(j,t)6Z} 

< t(X) ||x||^ 
which ends the proof since x(X) < v. 

5. 5. Proof of Lemm^E 
Since 

||P Vl x||f = (^x) t (nr)- 1 (%x), 

we have 

(1 + v)- 1 ||5^x||i < ||P Vl x||2 < (1 -v)- 1 ||%x||2 
applying the RIP Property ()2ip . Observing that 



||%x||i = (%x) t (%x)=^(^x t X l 



2 

I 



i el \ i=1 
we obtain the announced result. 

8.4- Proof of Proposition U\ 

Since the model under consideration is Y = Xot + W, we have for any I in {1 , . . . , k} 



R £ = Y Y ^ = JjXiCx)X i){ + Y W ^ 



leading to 



Rt - a £ = 21 r K /ocp + Xj W:= B t + V f 

f'=l,...,k,«V« 
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thanks to Condition (jTUl) . It follows 



|ll R lle,,2- l|a||g,,2| < \\R-ot\\g u2 

< l|B|| 6j) 2 + ||V|| g . 2 . 

Applying Lemma [3] with C = {1, . . . , k} and X = we obtain ||B||j;. 2 < 2 || cx|| 2 r((7j). Since 

1/q 



I CC|| j = ^~ 7" 1 0(^)1 < 
j=1 t=0 



L 

)=1 



we get 



||B||2. 2 <2M 2 v n r(^). 
Second, using Property f[2Tj) which holds because we assumed (ITT]) , we get 

< (1 +y) (x^.WjV^ 1 (x^.w) = (1 + v)||P VSj W||| 

which ends the proof. 

8. 5. Proof of Proposition [H 

Recall the definitions ( 1T8|) and ( 1T91) and let us put 



ct(X) = ax + (X T X x )- 1 Xi X IC% 



such that 



<x(X) - ocj = (Fx) 'XxXx^axc 
Since X C ^g, we have cfy = for any IgI and 

II a - a||| )2 = || ax - a(i3)x|||, 2 

< || a x - a(X) ||| )2 + || a(X) - a(X) ||| ;2 + 
:=ti(X)+t 2 (X)+t 3 . 



a(X)-a(£) 112 



(31) 
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Using twice the RIP Property and applying Lemma [3] for X := X and C := X c , we bound the 
first term 

ti (X) < -—— (oc(X) - (x x f T x (a(X) - cx z ) 
1 — v 

ctjcXjcXj) (Fx) 1 (XjXjcaxc) 



1 - v 



- M _ y] 2\\ X X X I c0L I c Wi,2 



2 ^1 ( ^1 ^ ( 



1 — "V 

J «ei Vex 



- fT^ l|a|l ^ ir(X) - 



Recall that in the specific case where X = Qg, we get x[Qb) < v by construction of the 
leader groups (see (IT4l)). so that 

tM B )<—^ 2 lk| 2 5 c, (32) 



For the study of ti{X), use Inequality (122]) 

t 2 (X) < — — ||X z a(X)-Xxa(X)||2 
1 — y 

and observe that 

Xx &(X) = P Vl [Xcx + W] = Xxa(X) + P Vl W 

to obtain the bound 



finally, use again Inequality (122~ 



and observe that 



t 2 (X) < y-j— ||P Vl W|||. (33) 



t 3 < ^-J— ||Xx6t(X) - X x dc(B)x\\ 2 2 
1 — v 



Xxa(X) - Xxcc(fi)x = Pv x [Xxa(X) - X x cc(B) x ] 

= P Vl [Xxa(X) - Xg B ot{B) + X GB/x rtB) gB/x \ 

= P Vl [Pv x [Xa + W] - P VgB [Xa + W] + Xg s/ xa(i3)g B/ x] 

= ?v x \Xg B /ztt>[B)g B /z]. 
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since Icfc Applying Lemma [2] and Lemma [3] for X := X and C := Qb/X, we get 

1 



t 3 < 



< 



< 



1 - v 
1 



:i v] 



|P Vz [Xg B /za(B)g B/ x]||; 



2 



cex \ i=i \i'eg B /i 



r(X) ||a(£)|| 2 B)1 . 



Writing 



|cx(B)|| 2 eJ < 3 ( ||a(B) - <x(£0|| 2 Bil + ||a(B) - a ||J 8>1 + ||a|| 2 8il ) 



we deduce that 



t 3 < 



1 



r(x) (t 2 (g B )+t ] (g B ) + \\oc\\ 2 g ^) 



and combining with fl32|) and (13*3]) . we obtain 



t 3 < 



r(X) 



2v 



1 



1-v] 



58,1 



1 



|Pv e8 W|| 2 +||(x|| 2 e)1 



:i-v) 

This ends the proof of the proposition. 

8. 6. Proof of Proposition [3] 

first, the pr oof concerning t he ca se where X is not random is standard and can be found 
for instance in iMougeot et al.l ( 120121 ). Assume now that X is random. We take into account 



all the non random possibilities X' C H for the set X and apply Proposition [3] in the non 
random case. As the cardinality of Ji is less than p L by the limitations imposed on X, we 

get, 

P (^l|Pv z [W]|| 2 > z 2 ) < Y_ P (^l|Pv z ,[W]|| 2 > z 2 ) 



< p L exp (-z 2 /8) 



< 



exp 



V8 



Llogp 



< exp (-z 2 /16) 



as soon as z 2 > 4 (sup #{X', X' C %}) and z 2 > 16 Llogp. To end up the proof, it remains 
to observe that sup #{X, X' C H} < It*. 
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